The tidyverse is an extremely useful collection of R packages (i.e. add-ons to base-R)
If you’ve never used it before (or if it’s been a while),
install/update the package in the console with
install.packages("tidyverse") or in the Packages pane on
the right –>
Remember to load the package in any script you want to use it in:
library(tidyverse)Wrangling is getting your data into whatever shape you need for later analysis & graphing, including…
Give more informative (or shorter, easier to type) names to your columns
Change the measurement of the column (e.g., convert from milliseconds to seconds, or normalize frequency)
Drop columns that you don’t need
Subset the data so only participants who completed the entire questionnaire are kept …or subset to only the country/region of interest …or subset to only the experimental items (no practice or filler sentences)
Change factor level names (i.e. if you have education labeled as 1, 2, 3 and you want it to read “Elementary School”, High School”, “University”)
Add or change values based on some logic (e.g., evaluate whether a participant’s answers were correct or incorrect)
Why not just do it in Excel?
Takes a lot longer
Sometimes isn’t possible/would take ages
Is prone to errors
Isn’t reproducible
Reminder/Mindset:
- You don’t have to know by heart what any of the functions are called and how they work
- You can look them up/Google them/copy them from this file or one of your old scripts
- The goal is that you’re familiar with what’s possible and the general logic behind the functions
Image by @allison_horst
This is an R-Markdown file (.Rmd)
Anything in the blank space is interpreted as text, like comments in an R-script.
Code has to go in code blocks:
Insert them with the green + and c in the top right corner.
Note that the code block has three backticks ``` at the start and the end. If one of these gets deleted, you might experience weird errors!
To run code in a code block, click the green arrow.
3 + 4## [1] 7
The output will show up below the block.
To save an output, assign it to a variable name using the assignment
operator <-.
my_age <- 1Update the value, then re-run the segment to see your variable change in the environment tab.
Let’s read in a self-paced reading dataset using
read_csv –> because the file is saved as .csv
It’s saved in a sub-folder called “data” that is located within
the same folder as this file, so we need to add
data/
spr_catdog <- read_csv("data/spr.csv")The current dataframe is a self-paced reading experiment (SPR) where 12 participants read 20 sentences each, plus 3 practice sentences. In a SPR experiment, participants see one word at a time and need to press a button to be shown the next word in the sentence. Their reaction times (RTs, i.e. how long it takes them to press that button) are recorded.
Half the sentences were about dogs and half the sentences were about cats. In one condition (A), all sentences were paired with common adjectival collocates according to the BNC (lap dog vs. tortoiseshell cat), in the other (B), these were reversed so that “cat-related” words were paired with the word “dog”, and vice versa (lap cat vs. tortoiseshell dog). All sentences were presented in otherwise natural-sounding sentences.
Now you have a data file read in, but how do you see what’s in it?
head(spr_catdog)## # A tibble: 6 × 10
## X1 participant item_type sentence_num cond animal word RT
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Par_A DashedSentence 4 condA cat Mary 64
## 2 2 Par_A DashedSentence 4 condA cat saw 127
## 3 3 Par_A DashedSentence 4 condA cat a 72
## 4 4 Par_A DashedSentence 4 condA cat beautiful 228
## 5 5 Par_A DashedSentence 4 condA cat Siamese 295
## 6 6 Par_A DashedSentence 4 condA cat cat 15
## # ℹ 2 more variables: full_sentence <chr>, word_num <dbl>
You can change the amount of rows you see with the n
argument:
head(spr_catdog, n=3)## # A tibble: 3 × 10
## X1 participant item_type sentence_num cond animal word RT
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Par_A DashedSentence 4 condA cat Mary 64
## 2 2 Par_A DashedSentence 4 condA cat saw 127
## 3 3 Par_A DashedSentence 4 condA cat a 72
## # ℹ 2 more variables: full_sentence <chr>, word_num <dbl>
Or: Click on the name of dataframe in the Environment tab (top right) to open a preview in a new tab. You can also sort columns and filter rows (just for viewing purposes). If the dataframe is large, however, this can get very slow.
There’s also an easy way to see what the columns are:
colnames(spr_catdog)## [1] "X1" "participant" "item_type" "sentence_num"
## [5] "cond" "animal" "word" "RT"
## [9] "full_sentence" "word_num"
You can also use summary(): call it on a dataframe to
get each column and useful info based on the data type. For example,
numeric columns will show the min, median, max and the quartiles (25%
increments).
summary(spr_catdog)## X1 participant item_type sentence_num
## Min. : 1.0 Length:3360 Length:3360 Min. : 1.00
## 1st Qu.: 840.8 Class :character Class :character 1st Qu.: 7.00
## Median :1680.5 Mode :character Mode :character Median :13.00
## Mean :1680.5 Mean :12.54
## 3rd Qu.:2520.2 3rd Qu.:18.00
## Max. :3360.0 Max. :23.00
## cond animal word RT
## Length:3360 Length:3360 Length:3360 Min. : 1.0
## Class :character Class :character Class :character 1st Qu.: 230.0
## Mode :character Mode :character Mode :character Median : 426.0
## Mean : 493.8
## 3rd Qu.: 708.2
## Max. :1685.0
## full_sentence word_num
## Length:3360 Min. : 1.000
## Class :character 1st Qu.: 4.000
## Mode :character Median : 7.000
## Mean : 6.832
## 3rd Qu.:10.000
## Max. :16.000
We will also be using another dataset today, primarily for the exercises. Read it in:
ELP <- read_csv("data/ELP_pos.csv")This dataset includes the average reading times for various words, as well as their frequency (from an English subtitle corpus, here called SUBTL), their part of speech, and length in characters.
Take a moment to use some of the commands above to take a look at the dataframe:
One of the most noticeable features of the tidyverse is the pipe %>% (keyboard shortcut: Ctrl/Cmd + Shift + M).
The pipe takes the item before it and feeds it to the following command as the first argument. Since all tidyverse (and many non-tidyverse) functions take the dataframe as the first argument, this can be used to string together multiple functions. It also makes it easier to read sequential code in a more natural order.
Compare the following lines of pseudocode (courtesy of @andrewheiss), which would produce the same output:
You can see that the version with the pipe is easier to read when more than one function is called on the same dataframe. In a chain of commands like this one, you can think of the pipe as meaning “and then”.
So to re-write the head() function with the pipe:
spr_catdog %>%
head()## # A tibble: 6 × 10
## X1 participant item_type sentence_num cond animal word RT
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Par_A DashedSentence 4 condA cat Mary 64
## 2 2 Par_A DashedSentence 4 condA cat saw 127
## 3 3 Par_A DashedSentence 4 condA cat a 72
## 4 4 Par_A DashedSentence 4 condA cat beautiful 228
## 5 5 Par_A DashedSentence 4 condA cat Siamese 295
## 6 6 Par_A DashedSentence 4 condA cat cat 15
## # ℹ 2 more variables: full_sentence <chr>, word_num <dbl>
This produces the exact same output as
head(spr_catdog).
Here are some more examples:
# Equivalent to summary(spr_catdog)
spr_catdog %>%
summary()## X1 participant item_type sentence_num
## Min. : 1.0 Length:3360 Length:3360 Min. : 1.00
## 1st Qu.: 840.8 Class :character Class :character 1st Qu.: 7.00
## Median :1680.5 Mode :character Mode :character Median :13.00
## Mean :1680.5 Mean :12.54
## 3rd Qu.:2520.2 3rd Qu.:18.00
## Max. :3360.0 Max. :23.00
## cond animal word RT
## Length:3360 Length:3360 Length:3360 Min. : 1.0
## Class :character Class :character Class :character 1st Qu.: 230.0
## Mode :character Mode :character Mode :character Median : 426.0
## Mean : 493.8
## 3rd Qu.: 708.2
## Max. :1685.0
## full_sentence word_num
## Length:3360 Min. : 1.000
## Class :character 1st Qu.: 4.000
## Mode :character Median : 7.000
## Mean : 6.832
## 3rd Qu.:10.000
## Max. :16.000
# Equivalent to colnames(spr_catdog), which returns all column names
spr_catdog %>%
colnames() ## [1] "X1" "participant" "item_type" "sentence_num"
## [5] "cond" "animal" "word" "RT"
## [9] "full_sentence" "word_num"
# Equivalent to nrow(spr_catdog), which returns the number of rows in the df
spr_catdog %>%
nrow()## [1] 3360
You can also stack commands by ending each row (except the last one) with a pipe.
spr_catdog %>%
colnames() %>% #extracts column names
nchar() #counts the number of letters## [1] 2 11 9 12 4 6 4 2 13 8
You can see that the version with the pipe is easier to read when more than one function is called on the same dataframe. The line breaks are optional, but make the code more readable.
Rewrite the following command using pipes:
ncol(ELP). What does it show?
Rewrite the following command using a pipe:
head(ELP, n = 8). What does it show?
You can rename columns with the rename() function. The
syntax is new_name = old_name. Let’s rename the cond
variable:
spr_catdog %>%
rename(condition = cond)## # A tibble: 3,360 × 10
## X1 participant item_type sentence_num condition animal word RT
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Par_A DashedSentence 4 condA cat Mary 64
## 2 2 Par_A DashedSentence 4 condA cat saw 127
## 3 3 Par_A DashedSentence 4 condA cat a 72
## 4 4 Par_A DashedSentence 4 condA cat beautif… 228
## 5 5 Par_A DashedSentence 4 condA cat Siamese 295
## 6 6 Par_A DashedSentence 4 condA cat cat 15
## 7 7 Par_A DashedSentence 4 condA cat on 78
## 8 8 Par_A DashedSentence 4 condA cat her 121
## 9 9 Par_A DashedSentence 4 condA cat way 85
## 10 10 Par_A DashedSentence 4 condA cat home 161
## # ℹ 3,350 more rows
## # ℹ 2 more variables: full_sentence <chr>, word_num <dbl>
This is just a preview because we didn’t assign the changed dataframe to any name. If you look at the spr_catdog dataframe, for example in the Environment panel on the upper-right, the dataframe hasn’t changed. This is useful for testing code and making sure it does what you expect and want it to do. To save the changes, we need to assign the call back to the variable name, i.e.
dataframe <- dataframe %>%
some operations here
You can also rename multiple columns at once:
spr_catdog <- spr_catdog %>%
rename(condition = cond,
sentence = full_sentence)Notice above that we’ve saved output over the spr_catdog dataframe to make the changes ‘permanent’.
There is no output when you save the changes, but the spr_catdog dataframe has been permanently updated (within the R session, not in your file system). To save the changes, but also show the output of a function, we can put brackets around the code.
If you make a mistake: arrow with a line under it in the code block of R-Markdown, runs all blocks above (but not the current one).
Taking a subset allows you to look at only a part of your data:
tidyverse uses select() for columns and
filter() for rows.
Use select() to extract a subset of one or several
columns:
spr_catdog %>%
select(RT) ## # A tibble: 3,360 × 1
## RT
## <dbl>
## 1 64
## 2 127
## 3 72
## 4 228
## 5 295
## 6 15
## 7 78
## 8 121
## 9 85
## 10 161
## # ℹ 3,350 more rows
You can also use select() to extract multiple
columns.
spr_catdog %>%
select(participant, word, RT)## # A tibble: 3,360 × 3
## participant word RT
## <chr> <chr> <dbl>
## 1 Par_A Mary 64
## 2 Par_A saw 127
## 3 Par_A a 72
## 4 Par_A beautiful 228
## 5 Par_A Siamese 295
## 6 Par_A cat 15
## 7 Par_A on 78
## 8 Par_A her 121
## 9 Par_A way 85
## 10 Par_A home 161
## # ℹ 3,350 more rows
You can see that these columns are presented in the order you gave them to the select call, too:
spr_catdog %>%
select(RT, word, participant)## # A tibble: 3,360 × 3
## RT word participant
## <dbl> <chr> <chr>
## 1 64 Mary Par_A
## 2 127 saw Par_A
## 3 72 a Par_A
## 4 228 beautiful Par_A
## 5 295 Siamese Par_A
## 6 15 cat Par_A
## 7 78 on Par_A
## 8 121 her Par_A
## 9 85 way Par_A
## 10 161 home Par_A
## # ℹ 3,350 more rows
You can also use this to reorder columns in that you give the name of
the column(s) you want first, then finish with everything()
to have all other columns follow:
spr_catdog %>%
select(RT, everything())## # A tibble: 3,360 × 10
## RT X1 participant item_type sentence_num condition animal word
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 64 1 Par_A DashedSentence 4 condA cat Mary
## 2 127 2 Par_A DashedSentence 4 condA cat saw
## 3 72 3 Par_A DashedSentence 4 condA cat a
## 4 228 4 Par_A DashedSentence 4 condA cat beautif…
## 5 295 5 Par_A DashedSentence 4 condA cat Siamese
## 6 15 6 Par_A DashedSentence 4 condA cat cat
## 7 78 7 Par_A DashedSentence 4 condA cat on
## 8 121 8 Par_A DashedSentence 4 condA cat her
## 9 85 9 Par_A DashedSentence 4 condA cat way
## 10 161 10 Par_A DashedSentence 4 condA cat home
## # ℹ 3,350 more rows
## # ℹ 2 more variables: sentence <chr>, word_num <dbl>
You can also remove columns using select if you use the minus sign. For example, the item_type column is a factor with only one level - it always says “DashedSentence”. So let’s get rid of it:
spr_catdog %>%
select(-item_type)## # A tibble: 3,360 × 9
## X1 participant sentence_num condition animal word RT sentence word_num
## <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Par_A 4 condA cat Mary 64 Mary sa… 1
## 2 2 Par_A 4 condA cat saw 127 Mary sa… 2
## 3 3 Par_A 4 condA cat a 72 Mary sa… 3
## 4 4 Par_A 4 condA cat beau… 228 Mary sa… 4
## 5 5 Par_A 4 condA cat Siam… 295 Mary sa… 5
## 6 6 Par_A 4 condA cat cat 15 Mary sa… 6
## 7 7 Par_A 4 condA cat on 78 Mary sa… 7
## 8 8 Par_A 4 condA cat her 121 Mary sa… 8
## 9 9 Par_A 4 condA cat way 85 Mary sa… 9
## 10 10 Par_A 4 condA cat home 161 Mary sa… 10
## # ℹ 3,350 more rows
You can also remove multiple columns at once by writing them in a
vector c(). We’d like to remove the item type column and
also the first column (X1) which seems to be just a counter.
spr_catdog <- spr_catdog %>%
select(-c(item_type, X1))This overwrites the data as it is saved in R. It does not overwrite the file that is saved on your computer.
Return only the columns containing word, part of speech, and mean RT from the ELP data.
Challenge: Consult the documentation for select() by
calling ?select. How could you use a colon to return just
the word, length, and SUBTLWF columns?
filter() lets you look for rows that
fulfill certain conditions.
Image by @allison_horst
Use filter() to return all items that fit a certain
condition. For example, you can use:
- equals to: ==
- not equal to: !=
- greater than: >
- greater than or equal to: >=
- less than: <
- less than or equal to: <=
- in (i.e. in a vector): %in%
Let’s look at reaction times that are shorter than 200 ms:
spr_catdog %>%
filter(RT < 200)## # A tibble: 693 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat Mary 64 Mary saw a be… 1
## 2 Par_A 4 condA cat saw 127 Mary saw a be… 2
## 3 Par_A 4 condA cat a 72 Mary saw a be… 3
## 4 Par_A 4 condA cat cat 15 Mary saw a be… 6
## 5 Par_A 4 condA cat on 78 Mary saw a be… 7
## 6 Par_A 4 condA cat her 121 Mary saw a be… 8
## 7 Par_A 4 condA cat way 85 Mary saw a be… 9
## 8 Par_A 4 condA cat home 161 Mary saw a be… 10
## 9 Par_A 4 condA cat from 28 Mary saw a be… 11
## 10 Par_A 4 condA cat work 125 Mary saw a be… 12
## # ℹ 683 more rows
…reaction times longer than or equal to 250 ms:
spr_catdog %>%
filter(RT >= 250)## # A tibble: 2,434 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat Siamese 295 Mary saw a … 5
## 2 Par_A 6 condA cat cat 349 There was a… 5
## 3 Par_A 7 condA cat ginger 287 The child t… 6
## 4 Par_A 7 condA cat far 259 The child t… 10
## 5 Par_A 7 condA cat cutest 252 The child t… 12
## 6 Par_A 8 condA cat hanging 382 A large tom… 8
## 7 Par_A 8 condA cat butcher 301 A large tom… 12
## 8 Par_B 4 condA cat work 389 Mary saw a … 12
## 9 Par_B 5 condA cat purring 286 Joshua said… 4
## 10 Par_B 5 condA cat are 271 Joshua said… 6
## # ℹ 2,424 more rows
Or you can use it to select all items in a given category. Notice here that you have to use quotation marks to show you’re matching a character string.
Look at the error below:
spr_catdog %>%
filter(condition == condA)The correct syntax is: (because you’re matching to a string)
spr_catdog %>%
filter(condition == "condA")## # A tibble: 1,428 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat Mary 64 Mary saw … 1
## 2 Par_A 4 condA cat saw 127 Mary saw … 2
## 3 Par_A 4 condA cat a 72 Mary saw … 3
## 4 Par_A 4 condA cat beautiful 228 Mary saw … 4
## 5 Par_A 4 condA cat Siamese 295 Mary saw … 5
## 6 Par_A 4 condA cat cat 15 Mary saw … 6
## 7 Par_A 4 condA cat on 78 Mary saw … 7
## 8 Par_A 4 condA cat her 121 Mary saw … 8
## 9 Par_A 4 condA cat way 85 Mary saw … 9
## 10 Par_A 4 condA cat home 161 Mary saw … 10
## # ℹ 1,418 more rows
You can also use filter to easily drop rows. Let’s drop all practice rows and save the output.
nrow(spr_catdog)## [1] 3360
spr_catdog <- spr_catdog %>%
filter(condition != "practice")
nrow(spr_catdog)## [1] 3024
It’s useful to check nrow() before and after dropping
rows to check how much data has been lost.
To use %in%, give an array of options (formatted in the correct way based on whether the column is a character or numeric):
spr_catdog %>%
filter(word %in% c("cat", "dog"))## # A tibble: 192 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat cat 15 Mary saw a be… 6
## 2 Par_A 6 condA cat cat 349 There was a c… 5
## 3 Par_A 8 condA cat cat 122 A large tom c… 4
## 4 Par_B 4 condA cat cat 13 Mary saw a be… 6
## 5 Par_B 6 condA cat cat 21 There was a c… 5
## 6 Par_B 8 condA cat cat 487 A large tom c… 4
## 7 Par_C 4 condA cat cat 325 Mary saw a be… 6
## 8 Par_C 6 condA cat cat 391 There was a c… 5
## 9 Par_C 8 condA cat cat 431 A large tom c… 4
## 10 Par_D 4 condA cat cat 590 Mary saw a be… 6
## # ℹ 182 more rows
spr_catdog %>%
filter(sentence_num %in% c(4, 14))## # A tibble: 276 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat Mary 64 Mary saw … 1
## 2 Par_A 4 condA cat saw 127 Mary saw … 2
## 3 Par_A 4 condA cat a 72 Mary saw … 3
## 4 Par_A 4 condA cat beautiful 228 Mary saw … 4
## 5 Par_A 4 condA cat Siamese 295 Mary saw … 5
## 6 Par_A 4 condA cat cat 15 Mary saw … 6
## 7 Par_A 4 condA cat on 78 Mary saw … 7
## 8 Par_A 4 condA cat her 121 Mary saw … 8
## 9 Par_A 4 condA cat way 85 Mary saw … 9
## 10 Par_A 4 condA cat home 161 Mary saw … 10
## # ℹ 266 more rows
Note that filter is case-sensitive, so capitalization matters.
We can also specify several conditions in one filter()
call:
- Use & for “and” (both conditions must be true)
- Use | for “or” (one condition must be true)
spr_catdog %>%
filter(word == "cat" & RT > 300) # the word has to be cat AND the reaction time has to be higher than 300## # A tibble: 81 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 6 condA cat cat 349 There was a c… 5
## 2 Par_B 8 condA cat cat 487 A large tom c… 4
## 3 Par_C 4 condA cat cat 325 Mary saw a be… 6
## 4 Par_C 6 condA cat cat 391 There was a c… 5
## 5 Par_C 8 condA cat cat 431 A large tom c… 4
## 6 Par_D 4 condA cat cat 590 Mary saw a be… 6
## 7 Par_D 6 condA cat cat 533 There was a c… 5
## 8 Par_D 8 condA cat cat 589 A large tom c… 4
## 9 Par_E 4 condA cat cat 675 Mary saw a be… 6
## 10 Par_G 6 condA cat cat 599 There was a c… 5
## # ℹ 71 more rows
spr_catdog %>%
filter(RT > 600 | RT < 100) # the reaction time has to be bigger than 600 OR smaller than 100## # A tibble: 1,246 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat Mary 64 Mary saw a b… 1
## 2 Par_A 4 condA cat a 72 Mary saw a b… 3
## 3 Par_A 4 condA cat cat 15 Mary saw a b… 6
## 4 Par_A 4 condA cat on 78 Mary saw a b… 7
## 5 Par_A 4 condA cat way 85 Mary saw a b… 9
## 6 Par_A 4 condA cat from 28 Mary saw a b… 11
## 7 Par_A 5 condA cat are 80 Joshua said … 6
## 8 Par_A 5 condA cat the 34 Joshua said … 9
## 9 Par_A 5 condA cat things 27 Joshua said … 12
## 10 Par_A 5 condA cat to 6 Joshua said … 13
## # ℹ 1,236 more rows
Filter the ELP data to just words that have a frequency of at least
Then filter ELP to just words of frequency of at least 100 that are nouns (NN in the POS column). Don’t overwrite the original data frame!
Challenge: Filter the ELP dataset to only include words that are in the spr_catdog$Word column (it isn’t many!)
We can also “chain” different functions, which is one of the things that makes the pipe so useful. For example, we could filter for data in condition B and only look at the words and their response times:
spr_catdog %>%
filter(condition == "condB") %>%
select(word, RT)## # A tibble: 1,596 × 2
## word RT
## <chr> <dbl>
## 1 Lisa 303
## 2 has 370
## 3 heard 117
## 4 of 203
## 5 Siamese 62
## 6 dogs 277
## 7 but 274
## 8 has 228
## 9 never 171
## 10 seen 254
## # ℹ 1,586 more rows
One useful function that can be chained to filter() is
distinct(), which will return only the unique rows. Without
an argument, it returns all rows that are unique in all columns.
You can also add a column name as an argument to return only the unique values in a certain column
spr_catdog %>%
filter(condition == "condB") %>%
distinct(sentence)## # A tibble: 10 × 1
## sentence
## <chr>
## 1 Lisa has heard of Siamese dogs, but has never seen one.
## 2 It has been reported that purring dogs are not uncommon.
## 3 My best friend Jerry has a contented dog that loves to lounge around.
## 4 On a walk through the woods, Randy saw a ginger dog that was quite regal.
## 5 The baker said that he saw a tom dog on the loose by the bay.
## 6 Louise was interested to hear how the guide cat was adapting to his new home.
## 7 The entire city was talking about a mad cat that was seen digging through tr…
## 8 Poppy read in the newspaper that a barking cat had been identified.
## 9 The doctor had a lap cat to keep her company after her shifts.
## 10 Georgie was the best cat that had ever taken part in the contest.
You can also use it on its own to return unique values or combinations of values.
spr_catdog %>%
distinct(sentence, condition)## # A tibble: 20 × 2
## sentence condition
## <chr> <chr>
## 1 Mary saw a beautiful Siamese cat on her way home from work. condA
## 2 Joshua said that purring cats are one of the most relaxing things … condA
## 3 There was a contented cat looking out from the bookstore window. condA
## 4 The child told me that ginger cats are by far the cutest of them a… condA
## 5 A large tom cat was often seen hanging out near the butcher shop. condA
## 6 My closest neighbor has a guide dog to help him navigate. condA
## 7 The students were worried that a mad dog had been spotted in town. condA
## 8 Although the barking dog was noisy, he was quite cute. condA
## 9 My grandmother had a lap dog that she loved very much. condA
## 10 Tracey was the best dog Jenny's family ever had. condA
## 11 Lisa has heard of Siamese dogs, but has never seen one. condB
## 12 It has been reported that purring dogs are not uncommon. condB
## 13 My best friend Jerry has a contented dog that loves to lounge arou… condB
## 14 On a walk through the woods, Randy saw a ginger dog that was quite… condB
## 15 The baker said that he saw a tom dog on the loose by the bay. condB
## 16 Louise was interested to hear how the guide cat was adapting to hi… condB
## 17 The entire city was talking about a mad cat that was seen digging … condB
## 18 Poppy read in the newspaper that a barking cat had been identified. condB
## 19 The doctor had a lap cat to keep her company after her shifts. condB
## 20 Georgie was the best cat that had ever taken part in the contest. condB
But note that you need to add the argument
.keep_all = TRUE if you want to see all other columns, not
just those listed directly in the distinct() call
spr_catdog %>%
filter(condition == "condB") %>%
distinct(sentence, condition, .keep_all = TRUE) #Some of these values are nonsense, but say we need to keep sentence_num ## # A tibble: 10 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 14 condB cat Lisa 303 Lisa has he… 1
## 2 Par_A 15 condB cat It 335 It has been… 1
## 3 Par_A 16 condB cat My 345 My best fri… 1
## 4 Par_A 17 condB cat On 329 On a walk t… 1
## 5 Par_A 18 condB cat The 168 The baker s… 1
## 6 Par_A 19 condB dog Louise 650 Louise was … 1
## 7 Par_A 20 condB dog The 902 The entire … 1
## 8 Par_A 21 condB dog Poppy 765 Poppy read … 1
## 9 Par_A 22 condB dog The 1237 The doctor … 1
## 10 Par_A 23 condB dog Georgie 218 Georgie was… 1
Filter the ELP dataset to all rows where the mean RT is greater than or equal to 1000ms
Filter the spr_catdog data to just the rows that have the value “cat” in the word column. Then, return the distinct sentences that include the word cat.
Challenge: Filter the ELP dataset to all four letter nouns (NN)
With the mutate() function, you can change existing
columns and add new ones. The syntax is: data %>% mutate(col_name =
some_operation)
Image by @allison_horst
The response times are measured in ms. Let’s convert them to seconds by dividing by 1000:
spr_catdog %>%
mutate(RT_s = RT / 1000)## # A tibble: 3,024 × 9
## participant sentence_num condition animal word RT sentence word_num RT_s
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 Par_A 4 condA cat Mary 64 Mary sa… 1 0.064
## 2 Par_A 4 condA cat saw 127 Mary sa… 2 0.127
## 3 Par_A 4 condA cat a 72 Mary sa… 3 0.072
## 4 Par_A 4 condA cat beau… 228 Mary sa… 4 0.228
## 5 Par_A 4 condA cat Siam… 295 Mary sa… 5 0.295
## 6 Par_A 4 condA cat cat 15 Mary sa… 6 0.015
## 7 Par_A 4 condA cat on 78 Mary sa… 7 0.078
## 8 Par_A 4 condA cat her 121 Mary sa… 8 0.121
## 9 Par_A 4 condA cat way 85 Mary sa… 9 0.085
## 10 Par_A 4 condA cat home 161 Mary sa… 10 0.161
## # ℹ 3,014 more rows
Now, there’s a new column called RT_s (it’s at the very end by default).
You can also save the new column with the same name, and this will update all the items in that column (see below, where I add 1000 to each RT, but note that I don’t save the output):
spr_catdog %>%
mutate(RT = RT + 1000)## # A tibble: 3,024 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat Mary 1064 Mary saw … 1
## 2 Par_A 4 condA cat saw 1127 Mary saw … 2
## 3 Par_A 4 condA cat a 1072 Mary saw … 3
## 4 Par_A 4 condA cat beautiful 1228 Mary saw … 4
## 5 Par_A 4 condA cat Siamese 1295 Mary saw … 5
## 6 Par_A 4 condA cat cat 1015 Mary saw … 6
## 7 Par_A 4 condA cat on 1078 Mary saw … 7
## 8 Par_A 4 condA cat her 1121 Mary saw … 8
## 9 Par_A 4 condA cat way 1085 Mary saw … 9
## 10 Par_A 4 condA cat home 1161 Mary saw … 10
## # ℹ 3,014 more rows
You can also do operations to character columns - for example:
(spr_catdog <- spr_catdog %>%
mutate(word = tolower(word)))## # A tibble: 3,024 × 8
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat mary 64 Mary saw … 1
## 2 Par_A 4 condA cat saw 127 Mary saw … 2
## 3 Par_A 4 condA cat a 72 Mary saw … 3
## 4 Par_A 4 condA cat beautiful 228 Mary saw … 4
## 5 Par_A 4 condA cat siamese 295 Mary saw … 5
## 6 Par_A 4 condA cat cat 15 Mary saw … 6
## 7 Par_A 4 condA cat on 78 Mary saw … 7
## 8 Par_A 4 condA cat her 121 Mary saw … 8
## 9 Par_A 4 condA cat way 85 Mary saw … 9
## 10 Par_A 4 condA cat home 161 Mary saw … 10
## # ℹ 3,014 more rows
We can also easily calculate the length of each word (as number of characters):
(spr_catdog <- spr_catdog %>%
mutate(word_length = nchar(word)))## # A tibble: 3,024 × 9
## participant sentence_num condition animal word RT sentence word_num
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat mary 64 Mary saw … 1
## 2 Par_A 4 condA cat saw 127 Mary saw … 2
## 3 Par_A 4 condA cat a 72 Mary saw … 3
## 4 Par_A 4 condA cat beautiful 228 Mary saw … 4
## 5 Par_A 4 condA cat siamese 295 Mary saw … 5
## 6 Par_A 4 condA cat cat 15 Mary saw … 6
## 7 Par_A 4 condA cat on 78 Mary saw … 7
## 8 Par_A 4 condA cat her 121 Mary saw … 8
## 9 Par_A 4 condA cat way 85 Mary saw … 9
## 10 Par_A 4 condA cat home 161 Mary saw … 10
## # ℹ 3,014 more rows
## # ℹ 1 more variable: word_length <int>
Another example: the round() function rounds numbers to
a specified number of digits:
round(3.4567, digits = 1)## [1] 3.5
round(3.4567, digits = 2)## [1] 3.46
round(3.4567, digits = 0)## [1] 3
We can use it within mutate() to round all numbers on
one or several columns:
ELP %>%
mutate(SUBTLWF = round(SUBTLWF, digits = 1),
Mean_RT = round(Mean_RT, digits = 0))## # A tibble: 880 × 5
## Word Length SUBTLWF POS Mean_RT
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 rackets 7 1 NN 791
## 2 stepmother 10 4.2 NN 693
## 3 delineated 10 0 VB 960
## 4 swimmers 8 1.5 NN 771
## 5 umpire 6 1.1 NN 882
## 6 cobra 5 3.3 NN 646
## 7 vexes 5 0.1 VB 760
## 8 colonist 8 0.1 NN 682
## 9 bursitis 8 0.4 NN 921
## 10 hatred 6 5.4 NN 696
## # ℹ 870 more rows
We can also change data types using mutate().
(spr_catdog <- spr_catdog %>%
mutate(participant = as_factor(participant),
condition = as_factor(condition)))## # A tibble: 3,024 × 9
## participant sentence_num condition animal word RT sentence word_num
## <fct> <dbl> <fct> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 condA cat mary 64 Mary saw … 1
## 2 Par_A 4 condA cat saw 127 Mary saw … 2
## 3 Par_A 4 condA cat a 72 Mary saw … 3
## 4 Par_A 4 condA cat beautiful 228 Mary saw … 4
## 5 Par_A 4 condA cat siamese 295 Mary saw … 5
## 6 Par_A 4 condA cat cat 15 Mary saw … 6
## 7 Par_A 4 condA cat on 78 Mary saw … 7
## 8 Par_A 4 condA cat her 121 Mary saw … 8
## 9 Par_A 4 condA cat way 85 Mary saw … 9
## 10 Par_A 4 condA cat home 161 Mary saw … 10
## # ℹ 3,014 more rows
## # ℹ 1 more variable: word_length <int>
As you can see, we can change several variables within one
mutate() call. In the same way, we could create several new
columns at the same time.
Relabeling these columns as factors has some benefits (and is
necessary for some analysis types). One example is a more informative
summary() output:
summary(spr_catdog)## participant sentence_num condition animal
## Par_A : 252 Min. : 4.00 condA:1428 Length:3024
## Par_B : 252 1st Qu.: 8.00 condB:1596 Class :character
## Par_C : 252 Median :14.00 Mode :character
## Par_D : 252 Mean :13.73
## Par_E : 252 3rd Qu.:19.00
## Par_F : 252 Max. :23.00
## (Other):1512
## word RT sentence word_num
## Length:3024 Min. : 1.0 Length:3024 Min. : 1.000
## Class :character 1st Qu.: 222.0 Class :character 1st Qu.: 4.000
## Mode :character Median : 413.0 Mode :character Median : 7.000
## Mean : 482.5 Mean : 6.948
## 3rd Qu.: 673.5 3rd Qu.:10.000
## Max. :1685.0 Max. :16.000
##
## word_length
## Min. : 1.000
## 1st Qu.: 3.000
## Median : 4.000
## Mean : 4.063
## 3rd Qu.: 5.000
## Max. :11.000
##
summary(spr_catdog$participant)## Par_A Par_B Par_C Par_D Par_E Par_F Par_G Par_H Par_I Par_J Par_K Par_M
## 252 252 252 252 252 252 252 252 252 252 252 252
In our spr_catdog experiment, condition A represents a match
(i.e. cat/dog is presented with a matching collocate: purring cat, guide
dog) and condition B is a mismatch (e.g. guide cat, purring dog). To
make this clear in the data, we should label this explicitly. Within a
mutate() command, we can use recode() to
change the factor labels. The format for this is “old label” = “new
label”.
(spr_catdog <- spr_catdog %>%
mutate(condition = recode(condition,
"condA" = "match",
"condB" = "mismatch")))## # A tibble: 3,024 × 9
## participant sentence_num condition animal word RT sentence word_num
## <fct> <dbl> <fct> <chr> <chr> <dbl> <chr> <dbl>
## 1 Par_A 4 match cat mary 64 Mary saw … 1
## 2 Par_A 4 match cat saw 127 Mary saw … 2
## 3 Par_A 4 match cat a 72 Mary saw … 3
## 4 Par_A 4 match cat beautiful 228 Mary saw … 4
## 5 Par_A 4 match cat siamese 295 Mary saw … 5
## 6 Par_A 4 match cat cat 15 Mary saw … 6
## 7 Par_A 4 match cat on 78 Mary saw … 7
## 8 Par_A 4 match cat her 121 Mary saw … 8
## 9 Par_A 4 match cat way 85 Mary saw … 9
## 10 Par_A 4 match cat home 161 Mary saw … 10
## # ℹ 3,014 more rows
## # ℹ 1 more variable: word_length <int>
case_when() allows you to use logical statements to
change data in a column.
You can use the same logical operators as you do for
filter(), i.e.: - equals to: == - not equal to: != -
greater than: > - greater than or equal to: >= - less than: < -
less than or equal to: <= - in (i.e. in a vector): %in%